library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(wordcloud2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR)
Yuxuan Du: yd2739
Tara Zhan: fz2377
Yaduo Wang: yw4199
Chenshuo Pan :cp3384
Wenxin Tian: wt2369
Amidst the burgeoning gaming industry, popular belief posits genres such as shooters, sports, and racing as perennial favorites. As gamers, our team believes that there is a “success formula” to guarantee the success of video games. Our team’s avid gaming experience through STEAM has piqued our interest in this hypothesis. Our objective is to dissect and analyze the veracity of such claims and identify any characteristics that signal the popularity of STEAM games.
https://www.kaggle.com/datasets/fronkongames/steam-games-dataset
In the process of preparing our data for analysis, we have established a set of inclusion and variable selection criteria to ensure the relevance of our study. These criteria are vital to the successful execution of our data science investigation.
One of the primary inclusion criteria for our analysis is the exclusion of free games from our dataset. This decision is underpinned by the understanding that in free-to-play games, in-app purchases often serve as viable alternatives to traditional game sales. However, due to the unavailability of comprehensive data pertaining to these in-app purchases, their inclusion in our analysis would be both impractical and imprecise.
In addition, we have also opted to remove certain variables from our dataset that are non-essential for our analysis.
We eliminate columns containing URLs as they do not contribute meaningful information to our analytical objectives.
Variables associated with game developers and publishers will be filtered out for two primary reasons. Firstly, we posit that making predictions or quantify the effect of developer names may be challenging, if not unfeasible. Secondly, we hypothesize that brand effects can be more accurately captured through a combination of game features and keywords, for example, there is a tag for independent games, which is the complement of games made by large companies.
Furthermore, we will exclude the variables “Playtime in Current Two Weeks” as it may not be representative for older games, and “Average Playtime Forever” in favor of “Median Playtime Forever” due to potential distortions caused by AFK (Away From Keyboard) gamers.
It is worth noting that we are cognizant of the need to address potential data leakage issues and redundant feature when fitting our model. These aspects will be thoroughly addressed in subsequent stages of our analysis to ensure the robustness of our findings.
df = read.csv("DATA/games.csv")|>
janitor::clean_names()|>
subset(select = -c(dlc_count, about_the_game, reviews, header_image, website, support_url, support_email, metacritic_score, metacritic_url, notes, developers, publishers, screenshots, movies, score_rank, average_playtime_two_weeks, median_playtime_two_weeks, average_playtime_forever, peak_ccu, full_audio_languages))|>
subset(price > 0)
#Change the format of release date.
df = df|>
mutate(release_date = as.Date(release_date, format = "%b %d, %Y"))
In our project, the exploration of key features of popular games forms a central objective. To establish a precise definition of “popular games” in the context of our analysis, we will primarily adhere to the standard set by Steam, a prominent gaming platform[https://www.reddit.com/r/Steam/comments/ivz45n/what_does_the_steam_ratings_like_very_negative_or/] According to Steam’s rating system, we will categorize games as “popular” if they fall into the following rating categories: mostly positive, positive, very positive, overwhelmingly positive. This approach ensures that we consider user sentiment as a crucial aspect of popularity.
However, we recognize that popularity encompasses more than just ratings; it also pertains to the volume of users who purchase and engage with a game. To account for this broader perspective of popularity, we have incorporated the following criteria:
Estimated Owners: We will exclude games with an estimated owner count in the range of 0-0 and 0-20,000. This criteria ensures that we focus on games with a substantial user base.
Median Playtime: Games selected for analysis will have a median playtime exceeding 120 minutes. This threshold corresponds to the latest time frame within which Steam allows game refunds. This criterion further refines our definition of popular games by considering active engagement and retention, which are indicative of a game’s enduring appeal.
By combining these rating, ownership, and playtime criteria, we aim to create a comprehensive definition of “popular games” that accounts for both user sentiment and the size of the player base, enabling us to explore key features that contribute to a game’s popularity in a holistic manner.
df_popular = df|>
subset((positive+negative) > 10)|>
subset(positive/(positive+negative) > 0.7)|>
subset(estimated_owners != "0 - 20000")|>
subset(estimated_owners != "0 - 0")|>
subset(median_playtime_forever > 120)
df_unpopular = anti_join(df, df_popular, by="app_id")
summary(pull(df_popular, price))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.49 5.99 12.99 15.21 19.99 99.99
summary(pull(df_unpopular, price))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.29 2.59 4.99 8.14 9.99 999.00
Based on tables, it is evident that popular games consistently exhibit higher prices when compared to their less popular counterparts across all quadrants. This trend can be logically attributed to several key factors that play a pivotal role in shaping the pricing dynamics within the gaming industry. Game
Mechanics: Popular games often incorporate complex and engaging game mechanics that elevate the overall gaming experience. The development and refinement of such intricate mechanics requires investment in terms of design, testing, and optimization, all of which contribute to higher production costs and reflect on the final price.
Animation Effects: The visually captivating animation effects is a hallmark of popular games. These effects enhance immersion and player engagement, but their creation requires the expertise of skilled artists and animators, thereby increasing development expenses.
Game Optimization: Popular games prioritize performance optimization to ensure smooth gameplay across a wide range of hardware configurations(although some popular games still have tons of bugs). Achieving such optimization demands rigorous coding and testing efforts, which in turn incur additional costs.
Also, we used statistical test to test whether there is difference in the two price distribution.
df_unpopular = df_unpopular|>subset(name != "Ascent Free-Roaming VR Experience")
combined_data = rbind(data.frame(dataset = "unpopular", price = log(df_unpopular$price)),
data.frame(dataset = "popular", price = log(df_popular$price)))
ggplot(data = combined_data, aes(x = price, fill = dataset)) +
geom_histogram(aes(y = ..density..), binwidth = 0.75)+
facet_grid(~dataset)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The two distribution loos approximately normal, so we decide to use T-test to test whether two groups’s log transformed price have similar distribution.
H0: they have same mean price H1: they have different mean price
result <- t.test(log(df_popular$price), log(df_unpopular$price))
result
##
## Welch Two Sample t-test
##
## data: log(df_popular$price) and log(df_unpopular$price)
## t = 52.188, df = 6584.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7043446 0.7593242
## sample estimates:
## mean of x mean of y
## 2.343884 1.612050
The result shows a p-value smaller than 0.05, so we reject the null and conclude that the two groups have different mean price.
# we filter out Ascent Free-Roaming VR Experience since its price is 1000 dollars, which will make it hard to see any trend from our plot.
df|>
subset(name != "Ascent Free-Roaming VR Experience")|>
separate_wider_delim(cols = estimated_owners, delim = " - ", names = c("estimate_low", "estimate_high"))|>
mutate(estimate_low = as.numeric(estimate_low))|>
mutate(estimate_high = as.numeric(estimate_high))|>
mutate(estimate_mean = (estimate_high + estimate_low)/2)|>
mutate(estimate_mean = factor(estimate_mean))|>
ggplot(aes(x = estimate_mean, y = price))+
geom_boxplot()+
coord_flip()
From the price distribution plot of games categorized by different estimated owner counts, a notable pattern emerges. Specifically, games with a greater number of estimated owners tend to exhibit relatively higher prices. However, an intriguing observation is that once a certain threshold of estimated owners is reached, the prices of games begin to decline. ### Overall frequency of keywords(genre, categories, tags)
Here, we try to explore the trend of keywords in genre, categories, tags.
categories_freq = df|>
separate_rows(categories, sep = ",")|>
group_by(categories)|>
summarise(n_obs = n())|>
arrange(desc(n_obs))|>
ggplot(aes(x = reorder(categories, +n_obs), y = n_obs))+
geom_bar(stat="identity")+
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
coord_flip()
categories_freq
From the bar plot, it is evident that single-player games have highest frequency in our dataset, this is reasonable since multi-player games require higher optimization, single player games is the best choice for independent game maker. The VR related categories have low frequency in games, this is because the equipment is expensive and the technology is relatively new.
genre_freq = df|>
separate_rows(genres, sep = ",")|>
group_by(genres)|>
summarise(n_obs = n())|>
arrange(desc(n_obs))|>
ggplot(aes(x = reorder(genres, +n_obs), y = n_obs))+
geom_bar(stat="identity")+
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
coord_flip()
genre_freq
For the bar plot of genre, besides the most frequency keyword indie, other keywords are common when we talk about games. Sport, Racing, Action, Adventures includes most of the games genere.
tags_freq = df|>
separate_rows(tags, sep = ",")|>
group_by(tags)|>
summarise(n_obs = n())|>
subset(n_obs >= 5000)|>
arrange(desc(n_obs))|>
ggplot(aes(x = reorder(tags, +n_obs), y = n_obs))+
geom_bar(stat="identity")+
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
coord_flip()
tags_freq
As there are too many tags in steam, we only show most frequent tags, interestingly, we found there are more puzzle games compared to RPG, this might because the development cost of puzzle games are lower than RPG.
It might be informative to visualize the trend change of game genre developed in different years since WOW might be good game at 2000s, while games like Overwatch might be better game in 2020s.
genre_freq_year = df|>
mutate(year = year(release_date))|>
separate_rows(genres, sep = ",")|>
group_by(year, genres)|>
summarise(n_obs = n())|>
group_by(year)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
genre_freq_year_total = df|>
mutate(year = year(release_date))|>
separate_rows(genres, sep = ",")|>
group_by(year)|>
summarise(n_obs_total = n())
genre_freq_year_total|>
knitr::kable()
| year | n_obs_total |
|---|---|
| 1997 | 3 |
| 1998 | 1 |
| 1999 | 2 |
| 2000 | 2 |
| 2001 | 7 |
| 2002 | 2 |
| 2003 | 7 |
| 2004 | 7 |
| 2005 | 10 |
| 2006 | 79 |
| 2007 | 118 |
| 2008 | 230 |
| 2009 | 477 |
| 2010 | 419 |
| 2011 | 477 |
| 2012 | 659 |
| 2013 | 1011 |
| 2014 | 3312 |
| 2015 | 6109 |
| 2016 | 9853 |
| 2017 | 14823 |
| 2018 | 19051 |
| 2019 | 18579 |
| 2020 | 22777 |
| 2021 | 27694 |
| 2022 | 30735 |
| 2023 | 19446 |
| NA | 277 |
We could see from the table that there are very few games on steam before 2006, genre frequency might be strongly affected by randomness in those years, so we choose only to plot the trend after 2005.
#here we limit the year to later than 2002 since steam is lauched in 2002, we think the game before 2002 might be not representative.
left_join(genre_freq_year, genre_freq_year_total, by = "year")|>
subset(year > 2005)|>
mutate(genre_ratio = n_obs/n_obs_total)|>
ungroup()|>
plot_ly(x = ~year, y = ~genre_ratio, color = ~genres, type = "scatter", mode = "lines+markers", colors = "viridis")
popular_genres_bar_plot = left_join(genre_freq_year, genre_freq_year_total, by = "year")|>
subset(year > 2005)|>
mutate(genre_ratio = n_obs/n_obs_total)|>
ggplot(aes(x = year, y = genre_ratio, fill = genres)) +
geom_col(position = 'stack', width = 0.6)+
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1))
popular_genres_bar_plot
We found some interesting trend from the line plot:
Rise of Indie Games: A prominent observation is the surge in the frequency of the “Indie” genre, particularly after 2012. This phenomenon is likely attributable to the increase in the number of computers per person. These developments have made it easier for independent game developers to create and publish their games.
Shift in Genre Popularity: The line plot also highlights a noteworthy trend in the Action genre. While it remains dominant, it has experienced a gradual decline in frequency over the years. This shift may be influenced by changing player preferences and the emergence of diverse gaming experiences in other genres. It suggests that players are increasingly seeking a wider variety of gaming experiences beyond traditional action games.
Casual Games on the Rise: The period from 2012 to 2023 shows an increase in the frequency of “Casual” games. This may be indicative of a growing market for games that offer accessible and relaxed gameplay experiences. This might also indecate people are craving for relaxation instead of stimulation.
Early Access Strategy: The “Early Access” games on Steam experienced a notable increase in their frequency. This approach provides independent game developers with the opportunity to release their games in a partially completed state, allowing them to gather player feedback, iterate on their games, and secure investment more easily. The popularity of this strategy underscores its effectiveness in the independent game development ecosystem.
Wordcloud could help visualize the the frequency of keywords in different groups. Here, we separately create word cloud for popular and unpopular games.
overwhelm_popular_wordcloud = df|>
subset((positive+negative) > 500)|>
subset(positive/(positive+negative) > 0.95)|>
separate_rows(tags, sep = ",")|>
group_by(tags)|>
summarise(n_obs = n())|>
wordcloud2()
popular_tags_wordcloud = df_popular|>
separate_rows(tags, sep = ",")|>
group_by(tags)|>
summarise(n_obs = n())|>
wordcloud2()
unpopular_tags_wordcloud = df_unpopular|>
separate_rows(tags, sep = ",")|>
group_by(tags)|>
summarise(n_obs = n())|>
wordcloud2()
overwhelm_popular_wordcloud
popular_tags_wordcloud
unpopular_tags_wordcloud
The distinctions revealed among the three word clouds are insightful and provide valuable context for understanding the characteristics that differentiate popular and unpopular games:
Prevalence of Adventure Games in Unpopular Games: Adventure games appear to be more prevalent among unpopular games and are less common in popular games. This discrepancy may stem from the accessibility of game development tools like RPG Maker, which enables the creation of adventure games with relative ease. Consequently, the market may be flooded with a range of adventure games of varying quality, leading to their uneven representation among unpopular titles.
Emphasis on Difficulty in Popular Games: The word cloud for popular games prominently features the tag “difficult.” This aligns with the reality that some of the most highly regarded games in the industry, such as Elden Ring and Sekiro: Shadows Die Twice, are renowned for their challenging gameplay. The presence of this tag underscores the appeal of a high level of difficulty in certain popular titles.
Ratio of 2D to 3D Games: Notably, the word cloud suggests that the ratio of 2D games to 3D games is higher in popular games compared to unpopular ones. This observation may be attributed to resource constraints faced by smaller game development studios. In many cases, these studios may prioritize the development of engaging plotlines and intricate game mechanics over the creation of elaborate 3D environments, resulting in a higher prevalence of 2D games among popular titles.
Popularity of Pixel Graphics: The higher frequency of the “pixel graphics” tag in popular games is an intriguing finding. Pixel graphics are favored by independent developers as they allow a strong focus on game mechanics. This aligns with the success stories of games like Terraria and Minecraft, which have leveraged pixel graphics to create immersive and engaging gameplay experiences. The appeal of pixel graphics in popular games reflects the importance of gameplay and mechanics in driving their popularity.
Since it is hard for models to directly analyze keywords as string, we one-hot encoded categories, genres, and tags, which facilitate models to analyze the dataset.
During our data preprocessing phase, we identified certain tags that
had the potential to introduce data leakage into our dataset. These
tags, listed below, were deemed inappropriate for inclusion due to their
subjective and conclusive nature, making them difficult for a game
producer to ascertain accurately: Masterpiece,
Great Soundtrack, Addictive,
benchmark, classic. By excluding these tags
from our dataset, we could provide a model that help game makers to
determine the game’s potential popularity.
df_concat = df|>
mutate(keywords = paste(df$categories, df$genres, df$tags, sep = ","))|>
subset(select = -c(categories, genres, tags))
keywords_df = df_concat|>
subset(select = c(app_id, keywords))|>
separate_rows(keywords, sep = ",")|>
distinct(app_id, keywords, .keep_all = TRUE)|>
mutate(value = 1)|>
subset(keywords != "")|>
pivot_wider(names_from = keywords, values_from = value, values_fill = 0)
one_hot_encoded_df = left_join(df_concat, keywords_df, by = "app_id")
popular_encoded_df = one_hot_encoded_df|>
subset((positive+negative) > 10)|>
subset(positive/(positive+negative) > 0.7)|>
subset(estimated_owners != "0 - 20000")|>
subset(median_playtime_forever > 120)|>
mutate(popular = 1)
unpopular_encoded_df = anti_join(one_hot_encoded_df, popular_encoded_df, by="app_id")|>
mutate(popular = 0)
encoded_with_label_df = rbind(popular_encoded_df, unpopular_encoded_df)|>
janitor::clean_names()|>
subset(select = -c(positive, negative, median_playtime_forever, estimated_owners, recommendations, user_score, supported_languages, keywords, addictive, masterpiece, great_soundtrack, benchmark, classic))|>
mutate(windows = as.integer(as.logical(windows)))|>
mutate(mac = as.integer(as.logical(mac)))|>
mutate(linux = as.integer(as.logical(linux)))
dataset_without_id_name = encoded_with_label_df|>
subset(select = -c(app_id, name, release_date))|>
drop_na()
There are some redundant feature in our dataset, we might want to find those variables and remove some of them.
#this function could help us find tags that be potential subset of other tags. By getting the dot product of two on-hot encoded vector and compare its sum with current column's sum, if they are equal and larger than 0, they could be possible related tags.
find_col_pairs = function(df) {
n = ncol(df)
result = character(0)
for (i in 1:(n-1)) {
for (j in (i+1):n) {
col_i = df[, i]
col_j = df[, j]
dot_product = sum(col_i * col_j)
col_i_sum = sum(col_i)
col_j_sum = sum(col_j)
if (col_i_sum == dot_product & dot_product != 0 & col_i_sum != 0) {
result = c(result, paste(colnames(df)[i], colnames(df)[j], sep = "-"))
}
if (col_j_sum == dot_product & dot_product != 0 & col_j_sum != 0) {
result = c(result, paste(colnames(df)[j], colnames(df)[i], sep = "-"))
}
}
}
return(result)
}
subgroup_features = find_col_pairs(dataset_without_id_name)
From the result, most feature are sub-group of windows
which is expected since most games are on windows system. We decide to
remove this features since it is not informative to our model. We also
found tags batman, fox, birds,
football_american, tile_matching,
tracked_motion_controller_supportonly present once,
coding only appears twice, to make our model more
generalizable, we decide to drop these columns.
We identified redundant features such as ‘Multi_player’ and ‘multiplayer,’ as well as ‘Single_player’ and ‘singleplayer.’ Although they are not completely collinear, it was evident that they provided similar information. To enhance model interpretability, we decided to remove one of each redundant pair.
dataset_without_id_name = dataset_without_id_name|>
subset(select = -c(windows, multiplayer, birds, football_american, fox, batman, coding, tile_matching,tracked_motion_controller_support, singleplayer))
Here, we choose lasso regression since it could perform feature selection. We uses 5 fold-cv for lambda selecting. We first separate test set and train set to prevent possible data leakage.
#first, we split target and tags, then we split train and test datasets
dataset_without_id_name= dataset_without_id_name[sample(1:nrow(dataset_without_id_name)), ]
dataset_without_id_name$id = 1:nrow(dataset_without_id_name)
train = dataset_without_id_name |>
sample_frac(0.70)
test = anti_join(dataset_without_id_name, train, by = 'id')
train = train|>
subset(select = -c(id))
test = test|>
subset(select = -c(id))
train_x = train|>
subset(select = -c(popular))
train_y = train|>
pull(popular)
test_x = test|>
subset(select = -c(popular))
test_y = test|>
pull(popular)
set.seed(1234)
foldid = sample(1:5, size = nrow(train_x), replace = TRUE)
lambda = 10^(seq(2, -5, -0.1))
lasso_fit = glmnet(
x = as.matrix(train_x),
y = train_y,
lambda = lambda,
alpha=1,
family = "binomial"
)
lasso_cv = cv.glmnet(
x = as.matrix(train_x),
y = train_y,
lambda = lambda,
foldid = foldid,
alpha=1,
family = "binomial"
)
lambda_opt = lasso_cv$lambda.min
broom::tidy(lasso_fit) |>
select(term, lambda, estimate) |>
complete(term, lambda, fill = list(estimate = 0) ) |>
filter(term != "(Intercept)") |>
ggplot(aes(x = log(lambda, 10), y = estimate, group = term, color = term)) +
geom_path() +
geom_vline(xintercept = log(lambda_opt, 10), color = "blue", size = 1.2) +
theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This plot shows the best selection of lambda.
result = predict(lasso_fit, s = lambda_opt, newx = as.matrix(test_x), type = 'response')
prediction_obj = prediction(result, test_y)
performance_obj = performance(prediction_obj, "tpr", "fpr")
auc_value = performance(prediction_obj, "auc")@y.values[[1]]
plot(performance_obj, main = "ROC Curve")
text(0.6, 0.2, paste("AUC =", round(auc_value, 2)), col = "red")
result = as.data.frame(result)|>
mutate(prediction = ifelse(s1 > 0.5, 1, 0))|>
mutate(actual = test_y)|>
mutate(difference = ifelse(prediction != actual, 1, 0))
acc = (nrow(result) - sum(pull(result, difference)))/nrow(result)
result = result|>
mutate(actual = factor(actual, levels = c(1, 0)))|>
mutate(prediction = factor(prediction, levels = c(1, 0)))
confusionMatrix(data=pull(result, prediction), reference = pull(result, actual))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 576 308
## 0 1126 16979
##
## Accuracy : 0.9245
## 95% CI : (0.9206, 0.9282)
## No Information Rate : 0.9104
## P-Value [Acc > NIR] : 1.648e-12
##
## Kappa : 0.4093
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.33843
## Specificity : 0.98218
## Pos Pred Value : 0.65158
## Neg Pred Value : 0.93781
## Prevalence : 0.08963
## Detection Rate : 0.03033
## Detection Prevalence : 0.04655
## Balanced Accuracy : 0.66030
##
## 'Positive' Class : 1
##
Upon examining the confusion matrix, it becomes evident that our model struggles to correctly classify a substantial number of popular games. However, a noteworthy observation is that the rate of false positives is within an acceptable range. This observation suggests that there may be discernible patterns among unpopular games that our model can capture effectively. As a result, this model holds promise in providing valuable insights for identifying games that are less likely to achieve popularity.
Furthermore, our analysis reveals that the model’s performance, as indicated by the p-value, is statistically significant compared to the scenario of having no information. This suggests that our model does contain valuable information that contributes to its predictive capabilities. The significance of the p-value underscores the utility of the model in making informed predictions and decisions.
It is important to note that the model’s sensitivity, particularly in recognizing popular games, is relatively low. One plausible explanation for this discrepancy in sensitivity may stem from the imbalanced ratio of popular games to unpopular games within the dataset. The skewed distribution of popular and unpopular games can pose challenges for the model in learning generalized patterns for popular games, leading to reduced sensitivity. It might be useful to perform a down-sampling, but it is unsure whether such down-sampling will cause information loss.